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Abstract. The classical attenuation regulation of gene expression in 
bacteria is considered. We propose to represent the secondary RNA struc- 
ture in the leader region of a gene or an operon by a term, and we give a 
probabilistic term rewriting system modeling the whole process of such 
a regulation. 

1 Introduction 

Modeling the mechanisms of regulation of gene expression, allowing prediction 
of quantitative characteristics of this expression (such as estimation of the level 
of expression and concentration of the substrate) is an important research chal- 
lenge. In a previous work [LRSP06,LPRS07], a model of one particular kind 
of regulation, the classical attenuation regulation, has been suggested. In that 
model, the evolution of the secondary RNA structure in the leader region of a 
gene, and the progress of the ribosome and the polymerase along the RNA/DNA 
strands, are represented by a very special, elaborated in detail, Markov chain. 
In this chain the transition probability corresponding to the progress of the ri- 
bosome depends on a "control variable" — the concentration of charged tRNA 
molecules in the cell. All the other probabilities do not depend on the control 
variable, they can be determined from energy-based considerations. Termination 
and antitermination (of gene expression) correspond to particular random events 
in the Markov chain. In [LRSP06] , a Monte-Carlo simulation of this Markov chain 
led to biologically realistic dependence of termination probability from the con- 
trol variable. Due to a large size and a complex structure of the Markov chain, 
its simulation is a heavy computational task, but it was successfully solved, and 
a software tool called Rnamodel simulates one trajectory in fractions of a sec- 
ond [LRSP06,RNA]. However, the approach based on the direct description of 
the Markov chain and its simulation has some limitations, especially for a the- 
oretical analysis. Biologically, it would be nice to have a more structured and 
compact representation of the Markov chain and its instantaneous probability 

* The support of CNRS-RAS cooperation agreement 19122 Evolver is gratefully 
acknowledged. 



distributions over all states at every instant, or only for sufficiently large time, 
or only probabilities of the two biologically important events — termination and 
antitermination. 

Note that the problem of modeling the classical attenuation regulation, as 
stated in [LRSP06] and in the current article, is related to the representation 
of the transient behavior of the secondary structure on a sliding window on 
the RNA strand between the ribosome and the polymerase (see below for de- 
tails) . This differs from the kinetics of the secondary RNA structure on a fixed 
nucleotide sequence for unlimited time, i.e. unlimited number of steps, investi- 
gated in many papers. The structure that appears after a large amount of time 
is called equilibrium secondary RNA structure, it corresponds to a minimum of 
energy, see e.g. [Zuk03,FFHS00]. The tool Rnamodel has also the function of 
determining this equilibrium structure and its energy as a special part of the full 
model in [LRSP06]. However, real structures that appear on the RNA strand 
during the regulation process are far from the equilibrium and their energies are 
far from minimal. 

In this article we discover a regular internal structure of the Markov chain de- 
scribing the classical attenuation regulation. We show that it can be represented 
as a probabilistic term rewriting system for a particular type of terms. The set 
of rewriting rules can be large, but all of them are generated by a small set of 
(five) metarules. In fact we give the full description of the metarules and explain 
how to generate all the rules for the case of classical attenuation regulation. 

Potential benefits of such a representation are multiple: 

— easier and more precise modeling of regulation mechanisms depending on 
the dynamics of the secondary structure; 

— compact description of such mechanisms, perhaps in dedicated languages, 
and hence a better biological understanding of regulation processes; 

— convenient representation of secondary structures by terms; 

— specific analysis and simulation methods for rewriting systems. 

This article is structured as follows. In section 2 we describe shortly the 
biological phenomenon that we want to model: the mechanism of classical atten- 
uation regulation (CAR). In section 3 we introduce a class of terms and proba- 
bilistic term rewriting systems. In section 4 we represent a qualitative metamodel 
of the biological mechanism of CAR by a term rewriting system. In section 5 we 
refine the previous system and decorate its transitions with rates, thus obtaining 
a representation of the Markov chain by a probabilistic term rewriting system. In 
section 6 we show some simulation results. In section 7 we discuss some related 
work on term rewriting and its applications. In section 8 we conclude with a dis- 
cussion of perspectives of the rewriting approach to modeling the mechanisms 
involving RNA secondary structures, especially regulation. 

2 Classical attenuation regulation 

To begin with, we recall some well-known biological facts about the biological 
phenomenon playing the central role in this article. 



The expression of a group of structural genes (that is synthesis of the cor- 
responding proteins, which are ferments for a chemical reaction) can be regu- 
lated by a sequence of nucleotides placed on the DNA upstream inside the so 
called leader region of the genes [SB91]. This subsequence of the leader region 
is called the regulatory region. In this article we deal with one particular type of 
regulation, classical attenuation regulation (CAR) in bacteria. This regulation 
mechanism concerns structural genes (groups of genes — operons) that produce 
proteins which catalyze the synthesis of amino acids. The classical attenuation 
allows to activate such an operon when the cell contains a small concentration of 
the amino acid, to deactivate the operon whenever this concentration increases, 
and to do it fast. The mechanism of CAR involves several actors: the regulatory 
region on the DNA, its copy on the RNA, the ribosome, and a ferment called 
RNA polymerase (see Fig.l). 




Fig. 1. Classical attenuation regulation. The RNA polymerase Pol transcribes the regu- 
latory region Q, the ribosome Rib translates the leader peptide gene Q' . The movement 
of Rib on regulatory codons Q" is controlled by the concentration of charged tRNA. 
The secondary RNA structure uj between Rib and Pol brakes Pol and pushes it off 
the chain. If Pol reaches the structural genes, then they are expressed, i.e. transcribed 
and then translated. Note that in both the DNA and the RNA, we use Q, Q' and Q" 
to denote the regulatory region, the leader peptide gene, and the regulatory codons, 
respectively. 



For structural genes to be expressed two concurrent processes should suc- 
ceed: the regulatory region Q should be transcribed creating an RNA by RNA 
polymerase. At the same time the ribosome should be bound to the very begin- 
ning of the freshly created segment Q' (called the leader peptide gene) in the 
regulatory region Q on the RNA and starts translation of this leader peptide 
gene to an auxiliary protein. The essential part of the regulation process takes 
place when the ribosome moves on Q' on the RNA and the polymerase moves 
somewhere downstream of the ribosome on Q on the DNA. 

The ribosome moves "rightwards" (formally speaking, in the direction from 
the 5' to the 3' end) on a segment Q' of the sequence Q. Its speed is constant 



except on a subsequence Q" {regulatory codons) where it depends directly on the 
concentration of the amino acid (via charged tRNA concentration) . To the right 
of the ribosome and independently of it, the polymerase moves rightwards on Q. 
Between the ribosome and the polymerase a secondary structure w is formed on 
the RNA. This structure consists in pairing of some nucleotides, and it changes 
very fast. An important effect of the secondary structure u consists in slowing 
down the movement of the polymerase. There are two possible scenarios: 

— When Lo is strong enough, its "braking" action on the polymerase increases, 
and moreover, the polymerase can slip off the DNA (this can only happen 
on so-called T-rich sequence, where the connection of the polymerase and 
the DNA weakens). Such an event is called termination, and in this case 
the structural genes arc not expressed: the transcription of the regulatory 
region is aborted, the structural genes are not transcribed and therefore not 
translated. 

— Another possibility is that the ribosome moves fast enough to weaken or 
partly destroy most of the structure lo. In this case the polymerase safely 
traverses the T-rich sequence, and arrives to the end of the leader region Q. 
Next, the polymerase enters the structural genes, and their transcription, 
followed by translation are unavoidable. This event is called antitermination 
and in this case the structural genes are expressed. 

In the rest of this article we build a qualitative and a quantitative models of 
the regulation process described above. 

3 Terms and rewriting systems 
3.1 Unranked unordered terms 

Let U be a finite set of function symbols and X an enumerable set of variables 
(standing for sets of terms). The set Ti;[A'] of terms over S and X is the smallest 

set that satisfies: 

— SCTs[X], 

— {f{x)\f€SAxeX}CTs[X], 

— if f G S and s C Ti;[A'] is a set of terms, then f{s) is in Ti;[A']. 

By definition we also put /(0) = / for f & S. For convenience we write 
f{g,h{e)) instead of f{{9,h{{e})}). However one should remember that the 
coma-separated terms are unordered. 

Example 1. Let S = {e,f,g,h} and X = {x, y, z, . . . }, then the foUowings are 
terms in Tj:[X]: fig, He)), f{f{x)) and e{g,f). 

Note that we consider function symbols of variable arity. stands for Tx'[0]. 

Terms in Ts are called ground term,s. Variables arc used only to define substi- 
tution and rewriting rules. The "real" terms are ground terms. A substitution a 
is a mapping from X to 2'^^^'^\ written as a = {xi Ti, . . . ,Xn Tn}, where 



Ti, 1 < i < n, is a finite set of terms that substitutes the variable Xi. The term 
obtained by applying the substitution cr to a term t is written ta. We call it an 
instance of t. 

Let i? be a rule of the form I ^ r, where / and r are terms in Tx'[A']. For 
ground terms t, t' we write t -^r t' if there exists a substitution a such that 
t' can be obtained from t by replacing an occurrence of the subterm la by ra. 
-^R defines a relation between ground terms. Let be the reflexive transitive 
closure of -^r. 

Example 2. Let R = 1 ^ r with I = f{x,e), r = f{g{x),e) and t = e{f{h,e)), 
then t -^R t' where t' = e{f{g{h), e)). 

A term rewriting system (TRS) is a finite set of rules of the form / r. Given 
a TRS TZ and a set of terms I CTs, the language TZ*{I) is defined as the set of 
all ground terms that can be obtained from the terms in I by applying a finite 
number of times the rules from Tl, i.e., TV {I) = g Tj; | 3t' G I,t' t). 

Example 3. Let Tl = {f{x) —>■ g{f{x))} and I = {/(e, h)}, then 

n*{I) = {g^{f{e,h))\n€TN}. 

3.2 Probabilistic Term Rewriting Systems 

A Continuous Time Markov Chain is a pair (S*, p), where S* is a finite or enu- 
merable set of states and p : S x S ^ [0, oo) is the rate matrix. For s, s' S S, 
p{s, s') > means that there is a transition between states s and s', and that the 
probability for moving from s to s' within t time units is equal to 1 — e^''*^^'" 
If a state s has more than one outgoing transition (i.e., if there exist more than 
one state s' for which p{s,s') > 0) there exists a race between these transi- 
tions and the probability for moving from s to s' within t time units is equal to 

^(1 - e'^i^y*), where E(s) = ^ P(s,s'). 

s'es 

A (continuous time) Probabilistic term rewriting system (PTES) over SUX 
is a (finite) set of rules of the form l-^r, where / and r are terms in 
and A G (0, oo) is a rate. 

A PTRS TZ over S IJ X defines a continuous time Markov chain on ground 
terms M = {Tz:,p), where p{t,t') = A iff there exists a rule l-^r G TZ such 
that t — >i{ t', where R is the "non probabilistic" rule / — > r. 

Remark 1. If there are several rules (or several instances of the same rule) that 
lead from t to t' , then p{t, t') = ^A, where the sum is taken over all such rules 
or instances. 

4 Metamodel 

We want to model the phenomenon of the classical attenuation regulation de- 
scribed in section 2. 



We suppose that a regulatory region Q (see Fig. 1) is given and fixed in the 
sequel, it is a sequence (word) Q G {A, C,G,T}*, the letters of this alphabet 
are called nucleotides. Wc denote by the length of any word x and Xi the ith 
letter of x, so x = xiX2 ■ ■ ■ x\x\- The sequence Q can be folded^ in a way that 
some nucleotides of Q are paired: A with T and C with G. The complement of 
a nucleotide is written using a bar: A = T, T = A, C = G, G = C. We look in 
Q for subwords ( "stems" ) of the form 

QaQa+1 ■■■Qb and QcQc+i ■ ■ ■ Qd such that 

B-A = D-C, A + 3<B, B + 3<C (1) 

Qa = Qd, Qa+1 = Qd-1, ■ ■ ■ Qb = Qc ■ 

Any pair of such stems forms a hypohelix (see Figure 2, where the labels Ai,Bi, d 
and Di are positions in the word Q). 




Fig. 2. One hypohelix /. 



We describe a hypohelix / by a tuple of its stems' extremities / = {A, B, C, D), 
and we introduce the following notations: 

stemif) = [A, B] U [C, D], loop{f) = [B + 1,C-1], supp{f) = [A D]. 

There is a ribosome at some position on Q' and an RNA polymerase some- 
where to the right of it. Both move to the right, in one step the ribosome moves 
by three successive nucleotides and the polymerase by one nucleotide. The win- 
dow w = (R, P) represents the segment of RNA from the first position R after 
the end of the ribosome to the last position P before the beginning of the poly- 
merase. In fact the folding of the RNA sequence Q can only happen within the 
current window, i.e. between positions R and P. When the ribosome advances 
to the right, it can destroy the leftmost hypohelix of a current configuration, 
because it consumes the first three letters of the window. On the other hand any 
polymerase move adds one new letter to the window. 

^ only on its "active" part called window, as we will see below 



Formally a window has the form w = {R, P) with i?, P € IN. The following 
constraints should be satisfied: 

13 < i? < P < IQI (2) 
Thus, the window is moving and changing its length. 

Let W = {w = {R, P) I conditions (2) are satisfied} be the alphabet of all 
windows. We define 

stem{w) = 0, loop{w) = [R,P], supp{w) = [R,P] ■ 

We will write terms over the alphabet S of all hypohelices and all windows: 

S ^ HVJW where H ^ {f = {A, B, C, D) \ conditions (1) are satisfied}. 

We consider only terms of the form w{. . .) for some w & W (rooted by some 
window w). According to the conditions that we will define next, a symbol 
/ = {A, B, C, D) can appear in a term w{. . . ) only ii R < A and D < P, where 
w = {R,P). 

We say that a hypohelix / is embedded in g (which can be a hypohelix or 
a window), written f ~< g, ii supp{f) C loop{g). Two hypohelices / and g are 
disjoint, written / cxi 5, if supp{f) fl supp{g) = 0. We call / and g unknotted if 
either one of them is embedded in the other or they are disjoint. We say that 
g = (^2, -B2, C2, -D2) is an extension of / = (Ai, Pi, Ci, Di), denoted / □ g, if 
[^1, Pi] C [A2, P2] and P2 - Pi = Ci - C2, hence [Ci, Pi] C [C2, P2], and the 
pairing in g is an extension of that in /. See Figure 3. 

We call a term t over S well-formed if it satisfies the following conditions: 

(compatibility) any / and g appearing in t are unknotted, in particular any 

/ can appear at most once, 
(ordering) if / and g occur in t, then / -< g iS f is in the scope of g. 

The combination of two hypohelices in Figure 4 is biologically feasible, but 
according to our rules these hypohelices are incompatible. We believe that this 
restriction (crucial for representation by terms) does not undermine significantly 
the accuracy of the model. 

Notice, that a well-formed term of the form w{. . . ) (rooted by some window 
w) contains only hypohelices from 

This simple observation greatly simplifies the simulation process. 

In [LRSP06] an additional m,aximality condition is imposed. Using the ter- 
minology of this article, it requires that no hypohelix fint can be replaced by 
its proper extension without creating an overlapping. Here we do not impose 
this restriction. 

Each well-formed term represents a possible secondary RNA structure in a 
window in Q: the set of hypohelices that are present in this window. It could be 




Fig. 3. Relative positions of two hypoiiclices / and g: f -< g and f \xi g. Here / = 
{Ai,Bi,Ci,Di) and g = {A2, B2,C2, D2). On the left B2 < Ai and Di < C2, on the 
right D2 < A2. 




Fig. 4. Pseudo-knot: Ai < Bi < A2 < B2 < Ci < Di < C2 < D2. Such configurations 
are not allowed in our model. 



possible to allow knotted hypohelices, and hypohelices of length less than 3, but 
here we do not consider them. 

We extend the definitions of ixi and ^: let / be a term and c a set of terms, 

cixi f iS\/g € cigtxi f), 
c~<fiSygec{g~< /). 

In the former case we say that / and c are disjoint, in the latter that c is 
embedded into /. 

We start from a sequence Q without any pairing of nucleotides, this structure 

is described by a term w{) — "an empty window", where w = (13, 13). Our aim 
is to represent the evolution of the secondary structure in the window, as well 
as the progress of the ribosome and the polymerase, through rewriting terms 
starting from w{). Our rewriting system will generate only well-formed terms. 

On the whole, there are five rewriting Meta-rules: 

— Binding and decomposition of a hypohclix /: 

(w = 5(c,d)) < — ^ (w' = 5(c,/(d))) withccx/, d-</, (3) 

where c and d are sequences of terms. The concrete rewriting rules — and 
their rates — depend on c and d, as explained below. 

— Extension and reduction of a hypohelix 

{LU = f)^ {lu' = g) with frg. (4) 

— The window movement can be described by the following rules, where w = 

{R,py. 



{R,P){oj)^{R + 3,P){uj') , (5) 
iR,P){Lj) ^ {R,P + l){u;) , (6) 
w{cj) ^ ± . (7) 

In the last rule, _L is a special symbol denoting termination. Rules (5) describe 
the movement of the ribosome. In these rules, lo' is obtained from lo by removing 
only the possible symbol that is incompatible with the new window {R + 3, P), 
or replacing it by a "shorter" hypohelix. Indeed, if the leftmost hypohelix in w 
starts at a position between R and i? + 3, then the movement of the ribosome 
by three positions to the right will destroy this hypohelix. More formally, if 
u) -< {R + 3,P), then ui' = ui. Otherwise the ribosome destroys the leftmost 
hypohclix. In this case, there is a single symbol / in uj such that f yi. {R + 3, P). 
Suppose the subterm rooted by / is /(c). Then, oj' is obtained by replacing in 
u) f{c) by either /'(c) or c, depending on the size of /, where /' C /. 

Rules 6 describe the movement of the polymerase. Note that if the polymerase 
reaches a position P+1 where the structural genes are expressed, then we reach 
antitermination and the gene is expressed. 



5 Quantitative model 



Now, we introduce the rates of the five rewriting rules. 

Let . ■ ■ , fn{*)) be a term. Then the free loop length of the hypohehx 

h in this term is 

n 

Ih = I loop{h)\ - ^ I supp{fi)\ . 

This numeric characteristic corresponds to the number of nucleotides in the loop 
of the hypohelix h that do not participate in inner hypohelices. 

In order to define the rate, we have to consider the concrete rule corre- 
sponding to the Metarule (3). For any f,g,c = ci{xi), . . . ,Cm{xm) and d = 
rfi(yi), . . . , dniUn) such that cixi/, d -< f, f < g there is a concrete rule 

{lo = 5f(ci(a;i), . . . , Cm{xm), di{yi), 
i — > {^^' = g{ci{xi),...,Cm{xm), f{di{yi),...,dn{yn)))) (8) 

Recall that the subterms are unordered. Similarly the concrete rule correspond- 
ing to (4) is 

(w = a{ci{xi), Cm{xm), f{di{yi), dn{yn)))) 
< — > {u' = a{ci{xi),...,Cm{xm),9{di{yi),...,dn{yn)))) (9) 

Note that this transformation can change the free loop length of the hypohelix 
a. The rate of the rules (8-9) is denoted K{u} — *■ co'), given by 

K{w ^u') = K- exp Q iE{u) - E{io'))j , (10) 

where the energy E{u) = Ghei{yi)-\-Gioop{yi), k is a parameter — usually k = 10^ 
— and 




and h varies over all hypohelices from u). Eh represents the total stacking energy 
along the hypohehx h. It is the sum of stacking bond energies of the adjacent base 
pairs of h. B can take three different values depending on the three possible types 
of the loop of the hypohelix g: terminal loop, single-strand bulge and double- 
strand bulge. 

A codon is a triple of successive nucleotides. For a sequence Q', each codon 
is fixed to be either regulatory or non-regulatory. Analogously, each nucleotide 
in Q is fixed to be cither non T-rich or T-rich [LRSP06]. Let sq be the "radius" 
of a ribosome — distance from P-site to the end of the ribosome — usually 
So = 12, and let s\ be the "radius" of a polymerase — distance from the 5' 
end of a polymerase to its transcription center — usually si = 9. The rate of 
the rule (5) is denoted Xrib and is constant when R — sq is & position of a non- 
regulatory codon, and otherwise X^b depends on an external parameter c — the 



concentration of charged tRNA [SB91]. The rate of the rule (6) is denoted v and 
depends on secondary structure w in the window. The rule (7) applies only when 
P + si is a position of a T-rich nucleotide and its rate is denoted [i. 
In [LRSP06] the rate of the rule (5) was denoted Xrih and 

Ah6(c) = ^ . (12) 

The rate of the rule (6) was denoted v and 

1/ = 40 - P(a;) . (13) 
The rate of the rule (7) was denoted [i and 

M = \f{^) ■ (14) 

The function F{lo) in (13-14) for w = /i(*), . . . , /«(*) depends only on functional 
symbols (hypohelices) /i, . . . , /„, and not on the structure of their arguments 
denoted by *. More precisely P(w) = maxj F{fi), where 



5.exp(-im) 



{L2y ■ {p{f) -Por + 1 

with p{f) « I supp(f) \ ' ^(•^) ^^'^ "free distance" from / to the end P of the 
window: for / = {A, B, C, D) and w — (i?, P), wc have 

r{f) = R-D-Y,\supp{h)\. (16) 

i 

Other symbols in equation (15) denote constants: tq = 1,8 = 30, L2 = 27.1, po = 
0.18, see [LRSP06]. 

Note that the rates of the rules depend only on the local configuration as 
explained above and not on the outside context. In particular it does not depend 
on instantiations of a;i, . . . , Xm, yi, - ■ ■ ,yn- 

6 Simulation results 



ATGAAAGCAATTTTCGTACTGAAAGGTTGGTGGCGCACTTCCTGAAACGGGCAGTGT 
ATTCACCATGCGTAAAGCAATCAGATACCCAGCCCGCCTAATGAGCGGGCTTTTTTTTG 



Fig. 5. A regulatory region for trpE genes in E. coli. 



We have adapted the simulator described in [LRSP06] and available at [RNA] 
to obtain sequences of terms. As an example in Figure 6 we give one (slightly 
shortened and simplified) terminating trajectory of the regulation process for 
the trpE genes (responsible for the synthesis of tryptophan) in E. coli. The 
regulatory region itself is presented in Figure 5. 



7 Related Work 



References to the literature on RNA regulation mechanisms can be found in 
[LRSP06,LPRS07]. 

Term rewriting systems have been used in the so called Regular Model Check- 
ing framework [KMM+01,BT02,AJMd02,ALdR05]. They have been successfully 
applied to the analysis of parameterized systems [BT02,AJMd02,ALdR05] and 
multithreaded programs [BT02,BT03,Tou05]. However, in the regular model 
checking framework, the rewriting rules are not probabilistic. This work consti- 
tutes the first step towards the extension of the regular model checking frame- 
work with probabilistic rewriting rules. This would allow for example the anal- 
ysis of probabilistic parameterized systems and probabilistic multithreaded pro- 
grams. 

Rewriting systems have also been used in articles [BIK06,BCC"'"03] to model 
chemical reactions. Compared to our work, the rewriting systems considered 
in [BIK06,BCC+03] are not probabilistic. Moreover, these works consider the 
modeling of chemical reactions whereas we consider modeling of RNA secondary 

structure. 

Finally, probabilistic term rewriting systems have also been considered in 
[BH03,BK02,KSMA03]. But in these works, the symbols are of fixed arities and 
the terms are ordered, whereas in our framework, the symbols have arbitrary 
arities and the terms are not ordered. Moreover, as far as we know, this is the 
first time that probabilistic term rewriting systems are used to model attenuation 
regulation. 

8 Conclusions and perspectives 

We have established that the framework of probabilistic term rewriting systems 
provides compact and structured description of detailed models of RNA regula- 
tion. 

We intend to continue exploration of this framework. The most important 
task consists in the development of adequate data structures and algorithms, 
as well as approximation and abstraction methods for analysis of this kind of 
models. The next step would be a massive computational experimentation, the 
biological interpretation of results and validation of results by real biological 
data. 
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Fig. 6. A simulation result: one typical terminating trajectory for classical attenua- 
tion regulation of trpE genes in E. coli. Notations: — » means one rewriting; — > means 
several similar rcwritings; repeated window positions (e.g. repetitions of (40, 51))are 
replaced by a • symbol; _L means termination. There are 24 helices, denoted by letters 
from a to a;. 



